\documentclass[review]{elsarticle}
\usepackage{float}
\usepackage{caption}
\usepackage{rotating} 
\captionsetup{labelsep=space,justification=justified,singlelinecheck=off}
\usepackage{lineno,hyperref}
\modulolinenumbers[5]

\journal{Journal of \LaTeX\ Templates}

%%%%%%%%%%%%%%%%%%%%%%%
%% Elsevier bibliography styles
%%%%%%%%%%%%%%%%%%%%%%%
%% To change the style, put a % in front of the second line of the current style and
%% remove the % from the second line of the style you would like to use.
%%%%%%%%%%%%%%%%%%%%%%%

%% Numbered
%\bibliographystyle{model1-num-names}

%% Numbered without titles
%\bibliographystyle{model1a-num-names}

%% Harvard
%\bibliographystyle{model2-names.bst}\biboptions{authoryear}

%% Vancouver numbered
%\usepackage{numcompress}\bibliographystyle{model3-num-names}

%% Vancouver name/year
%\usepackage{numcompress}\bibliographystyle{model4-names}\biboptions{authoryear}

%% APA style
%\bibliographystyle{model5-names}\biboptions{authoryear}

%% AMA style
%\usepackage{numcompress}\bibliographystyle{model6-num-names}

%% `Elsevier LaTeX' style
\bibliographystyle{elsarticle-num}
%%%%%%%%%%%%%%%%%%%%%%%

\begin{document}

\begin{frontmatter}

\title{Improving coreference resolution in biomedical literature }
%\tnotetext[mytitlenote]{Fully documented templates are available in the elsarticle package on \href{http://www.ctan.org/tex-archive/macros/latex/contrib/elsarticle}{CTAN}.}

%% Group authors per affiliation:
%\author{Elsevier\fnref{myfootnote}}
%\address{Radarweg 29, Amsterdam}
%\fntext[myfootnote]{Since 1880.}

%% or include affiliations in footnotes:
\author[mymainaddress,mysecondaryaddress]{Miji Choi}
\ead{jooc1@student.unimelb.edu.au}
\author[mymainaddress]{Justin Zobel}
\ead{jzobel@unimelb.edu.au}
\author[mymainaddress]{Karin Verspoor\corref{mycorrespondingauthor}}
\cortext[mycorrespondingauthor]{Corresponding author}
\ead{karin.verspoor@unimelb.edu.au}



\address[mymainaddress]{Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia}
\address[mysecondaryaddress]{National ICT Australia (NICTA) Victoria Research Laboratory}

\begin{abstract} 
Natural language processing (NLP) technologies benefit for extraction of useful information such as biomolecular interactions and events in biomedical literature. Many strategies and technologies for information extraction have been developed, and several studies have shown that coreference expression is one of main obstacles for event extraction in the biomedical domain. 
In this paper, we investigate whether a state-of-the-art coreference resolution system in general domain can be adapted to the biomedical domain by evaluating on biomedical texts, and by comparing with a biomedical domain-specific system. Our evaluation shows that the approaches used for the general system are available for the biomedical domain as performing better than the domain-specific system in several aspects, inspite of overall poor performance on biomedical literature. In addition, we identify where existing approaches used for those two systems need improvement based on analysis of the evaluation result.

\end{abstract}

\begin{keyword}
Coreference Resolution, Domain adaptation
\end{keyword}

\end{frontmatter}

%\linenumbers

\section{Introduction} 
Text-mining technologies play an important role to process scientific literature which exponentially growths in today's life science. They enable to identify biomedical concepts such as genes, proteins, drugs, or diseases, extract relations (events), and then discover hidden links between those concepts. Recently, researchers in the field of biomedical text-mining have started shifting their interests toward for extraction of biomolecular events, and networks from biomolecule named-entity recognition (NER). Biomolecular networks consisted of events are widely used to understand complex biological processes that could explain specific health conditions in the biomedical, and pharmaceutical research, since gene regulatory and signal transduction networks describe processes of genes and protein-protein interactions, and metabolic pathways explain functional behaviours between proteins and chemical compounds. Thus, automatic extraction of interactions between biomolecule entities from biomedical literature can provide better understanding of complicated biomedical mechanisms. 
However, coreference expression, such as pronominal anaphors, or appositions generally written in biomedical literature is one of major obstacles for extraction of event information, reducing system prformance in information extraction \cite{kim2011overview,miwa2010event,kim2011extracting}. Due to the characteristics of biomedical context, abundant anaphoric expressions are used to refer to entities that previously mentioned in the same text, and interactions and events are described in successive sentences in a text as shown in the example passage below.  \\

\textit{...\textbf{Tepoxalin} is a new dual cyclooxygenase/5-lipoxygenase anti-inflammatory compound currently under clinical investigation. \textbf{It} has been shown to possess anti-inflammatory activity in a variety of animal models and more recently to inhibit IL-2 induced signal transduction. The current study was conducted to evaluate the cytokine modulating activity of tepoxalin and the role of iron in these effects. In human peripheral blood mononuclear cells (PBMC) stimulated with OKT3/PMA, \textbf{tepoxalin} inhibited lymphocyte proliferation with an IC50 of 6 microM. Additionally, \textbf{it} inhibited the production of LTB4 (IC50 = 0.5 microM) and the cytokines IL-2, IL-6 and TNF alpha (IC50 = 10-12 microM)...}\\

The pronoun \textit{it} in the example sentence is used to represent \textit{tepoxalin} in the previous sentence, but the event of \textit{inhibited} (\textit{it}[\textit{tepoxalin}],  the production of LTB4 and the cytokines IL-2, IL-6 and TNF alpha), may be ignored by systems for event extraction.

There has been a community-wide effort to address coreference issues in biomedical literature as series of BioNLP shared task in 2011 and 2013. The Protein Coreference task, which is to identify anaphoric coreference links related to proteins and genes, was organised at the BioNLP shared task 2011 \cite{kim2011overview}. Only six research teams participated in the task, and four systems out of them produced meaningful performance. In the task, performance was evaluated in two modes; the performance of detecting coreference relations involving protein entities (Protein coreference mode), and detecting coreference relations regardless of referring to specific protein entities (Surface coreference mode). Each system achieved better performance in the Protein coreference mode than Surface coreference mode, but the overall performance was interestingly quite low. The best performing achieved F-score 34\% with 73\% of precision and 22\% of recall at the Protein coreference mode \cite{kim2011taming}. The Coreference Resolution task was incorporated into the Genia Event Extraction shared task at the BioNLP 2013, but the coreference resolution task was not considered by any participating teams unfortunately \cite{kim2013genia}.

In this paper, characteristics of coreference expression in biomedical domain is analysed with the use of a training dataset provided in the Protein Coreference task at the BioNLP 2011. We evaluate a state-of-the-art general coreference resolution system on the biomedical texts comparing with a biomedical domain-specific system, and identify where there are opportunities to be able to improve system performance in biomedical domain.  

\section{Materials and Methods}
In this research, we evaluated an unsupervised sieve-like framework, Stanford CoreNLP \cite {lee2011stanford} which is the best performance system at the CoNLL 2011 shared task on the biomedical literature comparing with the Turku Event Extraction System (TEES) \cite {bjorne2011generalizing}. The TEES system is a biomedical domain-specific event extraction system developed with a support vector machine approach including a coreference resolution module, and we used the training model CO11 for this evaluation. The training dataset provided for the Protein Coreference task at the BioNLP 2011 \cite{kim2011overview} was used to evaluate those coreference resolution systems as a gold standard corpus. The gold standard corpus was annotated from 800 Pubmed journals abstracts, and includes 2,313 coreference relations which are pairs of anaphoric coreference mentions and their antecedents related to proteins and genes. Anaphors in the corpus are classified with several types of pronominal mentions as shown in Table \ref{tab: t1}. Relative pronouns such as \textit{which}, \textit{that}, or \textit{where} take account for 50\% of the gold standard corpus, and general pronouns such as possessive, personal, and demonstrate pronouns are followed with 33\%. There are also 346 mentions (15\%) that begin with \textit{the}, or \textit{these} such as \textit{the protein}, \textit{this gene}, or \textit{these complexes}, and 22 proper nouns including specific protein names such as \textit{SP-A}, \textit{NF-kappa B}, and \textit{DC}. 

\begin{table}
\caption{Statistics of anaphors in the gold standard annotation}
\label{tab: t1}
\begin{tabular} { l c p{5cm} }

\hline
\textbf{Type} 									&\textbf{Number} 	& \textbf{Example} \\
\hline 
Relative pronoun								& {1,164 (50\%)} 	& \textit{which, that, where} \\
Pronoun	 									& {764 (33\%)}		&  \\
\hspace{4mm}\textit{Possessive pronoun} 		& 423				& \textit{its, their} \\
\hspace{4mm}\textit{Demonstrative pronoun} 	& 142				& \textit{this, that, these} \\
\hspace{4mm}\textit{Personal pronoun} 			& 180				& \textit{it, they, them} \\
\hspace{4mm}\textit{Reflexive pronoun} 			& 13				& \textit{itself, themselves} \\
\hspace{4mm}\textit{Other pronoun} 			& 6					& \textit{both, other, either} \\
Definite noun phrase							& {346 (15\%)} 		& \textit{the protein, these genes} \\
Indefinite noun phrase							& {11 (0.5\%)} 		& \textit{a nuclear factor} \\
Proper noun									& {22 (1\%)} 		& \textit{SP-A}, \textit{NF-kappa B}, and \textit{DC} \\
Unclassified										& {6} 				& \textit{most lymphokine genes} \\
\hline 
\textbf{Total}									& \textbf{2,313} 	& \\
\hline

\end{tabular}
\end{table}


Table \ref{tab: t2} describes characteristics of coreference relations in the gold standard corpus including examples. Among 2,313 coreference pairs, 560 pairs involve protein and gene entities, and 270 pairs are complex structures including one, or more coordinating conjunctions such as \textit{and}, and \textit{or}. 389 pairs of anaphors and antecedents corefer across sentences, and 43 pairs have identical anaphors and antecedents. Those identical pairs  are mainly definite noun phrases (NPs). Out of 385 coreference pairs including definite NPs, indefinite NPs, proper nouns, and unclassified mentions, 254 pairs share head words between anaphors and antecedents, for example, \textit{this receptor} (anaphor), and \textit{the p75 tumor necrosis factor receptor} (antecedent), while 131 the rest relations refer specific protein or gene names.


\begin{table} 
\caption{Characteristics of coreference relations with analysis of antecedents}
\label{tab: t2}
\begin{tabular} { p{2.9cm} c p{8cm} }

\hline
\textbf{Type} 						&\textbf{Number} 	& \textbf{Example} \\
\hline 
Including protein	& 560	& We have analyzed the expression of \textbf{\textit{IL-2Ralpha, c-myc, and pim-1 genes}} in anti-CD3-activated human T lymphocytes. The induction of \textbf{\textit{these genes}} is associated with... [PMID-10068671] \\
\hline
Including coordinating conjunction	& 217	& Activation of T lymphocytes to produce cytokines is regulated by the counterbalance of \textbf{\textit{protein-tyrosine kinases and protein-tyrosine phosphatases}}, many of \textbf{\textit{which}} have a high degree of substrate... [PMID-10206983] \\
\hline
Cross sentence						& 389	& We found that \textbf{\textit{NFATx1}} DNA binding activity and interaction with AP-1 polypeptides ...suggesting the presence of intrinsic transcriptional activation motifs in both regions. We also identified a potent inhibitory sequence within \textbf{\textit{its}} N-terminal domain... [PMID-9121455]\\
\hline
Identical relation					& 43	& The level of mRNA expression of \textbf{\textit{the NM23 gene}} is significantly lower in cell lines ...Moreover, cell lines derived from tumours of patients with a disease-free survival of more than 24 months (24-58 months) express \textbf{\textit{the NM23 gene}} at higher levels... [PMID-7909963]\\
\hline
Head-word match					& 254	& Despite overwhelming evidence that enhanced production of \textbf{\textit{the p75 tumor necrosis factor receptor (p75TNF-R)}} accompanies development of specific human inflammatory pathologies such as multi-organ failure during sepsis, inflammatory liver disease, pancreatitis, respiratory distress syndrome, or AIDS, the function of \textbf{\textit{this receptor}} remains... [PMID-9763613]\\
\hline

\end{tabular}
\end{table}


\section{Results}
Results of performance for identification of coreference mentions, and determination of coreference relations for each system is compared in Table \ref{tab: t3}. Overall, the Stanford system achieved low performance with F-score 12\% and 2\% for identification of coreference mentions and relations respectively, while the TEES system achieved better performance with F-score 69\% and 37\% at coreference mention and relation levels respectively. Both systems demonstrate huge reduction in identification of coreference relations rather than in the detection at the mention level with the number of exact matches from 1,006 to 112 by the Stanford system, and from 2,466 to 546 by the TEES system. 

\begin{table} [H]
\caption{Performance of the tasks foridentification of coreference mention and coreference relation by Stanford system and TEES system evaluated on the training dataset of BioNLP-11 Protein Coreference shared task}
\label{tab: t3}
\begin{tabular} { l c c c c }
\hline
& \multicolumn{2}{c}{Stanford}  	      &\multicolumn{2}{c}{TEES}        \\
\hline
& Mentions & Relations & Mentions & Relations \\
\hline
Gold annotation & 4,367 & 2,313 & 4,367 & 2,313 \\
\hline
System detected & 12,848 & 7,387 & 2,796 & 707 \\
\hline
Exact match & 1,006 & 112& 2,466 & 564 \\
\hline
Precision & 0.08 & 0.02 & 0.88 & 0.80 \\
\hline
Recall & 0.23 & 0.05 & 0.56 & 0.24 \\
\hline
F-score &0.12 & 0.02 & 0.69 & 0.37 \\
\hline

\end{tabular}
\end{table}

The Stanford system detected a greater number of coreference mentions and relations than those in the gold standard corpus. Since the TEES system is biomedical domain-specific, the system achieved higher precision with 88\% for mention identification, and 80\% for relation determination, but resulted in smaller number of identification, which reduced system recall. We explore factors involving low performance for determination of coreference relations, and our investigation of each system is analysed based on types of coreference mentions, and characteristics of coreference relations in Figure \ref{fig:figure1}.

\begin{sidewaysfigure}
\centering
\scalebox{0.5}
{\includegraphics{Evaluation_original.png}}
\caption{Analysis of performance by Stanford and TEES system comparing to the gold standard corpus. (TP: True positive, FP: False positive, and CC: Coordinating conjunctions)}
\label{fig:figure1}
\end{sidewaysfigure}


\subsection* {Identical relation}
The Stanford system detected 2,579 identical coreference relations where an anaphor and its antecedent are identical, which account for 35\% out of the total number 7,387 the system detected. Among 2,579 relations, only 9 relations were exactly matched with the annotated corpus, and the rest were false positives of the system. This is due to different scope of coreference resolution between the Stanford system and the annotated corpus. The Stanford system aims to identify all mentions referring to the same entity rather than to determine coreference links involving only anaphoric mentions. \\

\subsection*  {Lack of domain-specific knowledge}
The annotated corpus includes 560 coreference relations involving specifically protein and gene entities. For those coreference relations, the Stanford system identified only 38 true positives, and 1,917 false positives in contrast with the TEES system identifying 155 true positives. This is due to that the Stanford system resolves coreference mentions with syntactic and discourse information, while the TEES system uses a biomedical domain-specific NER component for protein and gene mentions. This demonstrates that domain-specific knowledge is one of important factors that impacts on performance for coreference resolution. 

\subsection* {Ordering issue in determining antecedent}
The Stanford system identified a large number of false positives of coreference relations involving general pronouns. Table \ref{tab: t4} describes identification of coreference relations at the anaphoric mention match comparing to the pair match of anaphors and antecedents. Although the system was able to recognise anaphoric mentions matched with the annotated corpus, mostly failed to determine correct antecedents of those anaphoric mentions. One of main causes in identification of incorrect antecedents is an ordering issue, because the Stanford system tends to determine an antecedent by searching closer candidates. 

\begin{table} [H]
\caption{Performance of the tasks foridentification of coreference mention and coreference relation by Stanford system and TEES system evaluated on the training dataset of BioNLP-11 Protein Coreference shared task}
\label{tab: t4}
\begin{tabular}{ l  c c c c c c c c}
\hline
						& it 	& its 	& this & they & their & them & those & itself \\
\hline
Gold coreference pairs 	& 113 & 281 & 3 	& 55 	& 142 	  & 12 	   & 40	      & 10 \\ 

System detected 	 	& 140  & 282  & 12 	& 44 	& 115 	  & 6 	   & 0	      & 16 \\ 

Anaphor match only 	 & 101 & 273 & 0 	& 42 	& 109 	  & 5 	   & 0	      & 10 \\ 

Coreference pair match 	 & 25   & 28 	& 0 	& 4	& 10 	  & 1 	   & 0	      & 1 \\ 
\hline

\end{tabular}
\end{table}

\subsection* {Limited syntactic parsing information}
Figure 1-E shows very little coverage of relative pronouns by the Stanford system identifying only 3 coreference relations involving relative pronouns, where one of them is true positives, and the rest two are false positives. This is because that only NP tags are considered as anaphoric mentions by the Stanford system.

\subsection* {Internal-sentence boundary}
The TEES is limited to identify coreference relations where anaphors and antecedents corefere within a single sentence, which is one of main factors for the low system performance. In particular, definite noun phrases and their antecedents mostly link across sentences as shown in Figure \ref{fig:figure1}, and coreference relations involving definite noun phrases were ignored by the TEES system.

\subsection* {Disregard of definite NP}
Among 79 definite noun phrases involving internal-sentence relations, 10 mentions were recognised identifying 7 true positives by the TEES system. The system tends to identify anaphoric mentions such as relative pronouns, and general pronouns, rather than definite noun phrases.


\section{Discussion}
\subsection {Different scope in coreference resolution task}

It is shown that the Stanford system poorly works on coreference resolution in the biomedical text as shown in Table 3. One cause of the poor performance is the different scope of the resolution task. As described in Section 2, the gold standard corpus consists of pronominal anaphoric mentions, so the task of coreference resolution in the biomedical domain is to mainly determine correct antecedents for pronouns, or definite NPs involving biomedical entities in particular. On the other hand, the Stanford system targeting the newswire domain identifies all reference mentions referring to the same entity in a text, and as well including anaphoric mentions.\\

\paragraph*{Non-pronominal mention detection}
The Stanford system resulted in a greater number (7,387) of coreference pairs than 2,313 pairs in the gold standard corpus. Among detected 7,387 coreference pairs, 2,604 pairs involve pronominal mentions including definite NPs, while the rest 4,783 pairs consist of proper nouns (protein names) such as \textit{gamma B, NF-kappa B}, common nouns such as \textit{cells, transcription factors}, numbers such as \textit{1, -1, 0.1}, symbols of operation such as \textit{+, -, /}, roman numerals such as {I, II, III}, and experimental values such as \textit{p$<$0.01, 2h}. Those arbitrary terms are identified by the mention detection module in  the Stanford system.

\paragraph*{Identical coreference pairs}
In addition to non-pronominal mention detection, 2,579 identical coreference pairs where an anaphor and its antecedent are same are identified by the Stanford system, while the gold standard corpus includes only 43 pairs of identical relations. The Stanford system determines antecedents for anphoric mentions with a collection of rules processing different features such as (exact) string match, head match, and pronoun match, and the rule of exact string match is processed prior than others. 

\paragraph*{Distribution of antecedent length}
Regarding to 2,604 coreference pairs involving pronominal anaphors, Figure \ref{fig:figure2} describes that the Stanford system tends to determine antecedents with longer boundary, and long antecedents (over 7 word length) takes account 25\%, while antecedents with 1-4 word length are dominant in the gold standard corpus. 

\begin{figure} [H]
    \centering
    \includegraphics[width=4.0in]{fig2.png}
    \caption{Distribution of antecedent length between the gold standard corpus and resulted by the Stanford system}
    \label{fig:figure2}
\end{figure}

\subsection {Error analysis}

Regarding to coreference pairs involving protein and gene entities identified by the Stanford system, Table \ref{tab: t5} illustrates that 100 false positives randomly selected are classified with four error types, such as System Error, Partial-match, Partial-match in previous sentences, and Exact-match in previous sentences.

\begin{table} [H]
\caption{Classification of incorrect pairs involving protein names from false positives resulted by the Stanford system}
\label{tab: t5}
\begin{tabular} { l  p{1.1cm} p{1.1cm} p{2.1cm} p{2.1cm} p{1cm}}
\hline
Anaphors					& System Error 	& Partial match & Partial match in prev. sent. & Exact match in prev. sent. &Total\\	
\hline
\textit{It}			 		& {5} 			& {4} & {4} & {1} & {14}\\ 
\textit{They}			 	& {4} 			& {2} & {0} & {0} & {6}\\ 
\textit{Its}			 	& {16} 			& {33} & {10} & {1} & {60}\\ 
\textit{Their}			 	& {3} 			& {6} & {0} & {0} & {9}\\ 
\textit{Themselves}		& {0} 			& {1} & {0} & {0} & {1}\\ 
\textit{This gene}		& {2} 			& {0} & {1} & {0} & {3}\\ 
\textit{This protein}		& {2} 			& {1} & {0} & {0} & {3}\\ 
\textit{The gene}			& {0} 			& {0} & {1} & {0} & {1}\\  
\textit{The protein}		& {0} 			& {0} & {1} & {0} & {1}\\ 
\textit{These genes}		& {0} 			& {0} & {1} & {0} & {1}\\ 
\textit{These proteins}	& {0} 			& {1} & {0} & {0} & {1}\\ 
\hline
\textbf{Total}				& {32} 			& {48} & {18} & {2} & {100}\\
\hline

\end{tabular}
\end{table}

According to the error analysis, 48\% are partially matched with the gold standard corpus, and system error, which is to determine completely wrong antecedents, accounts for 32\%. Another 20\% are caused by that the system identifies antecedents mentioned in previous sentences, and which include exact-match (2\%), and partial-match (18\%).



%\begin{table} [H]
%\caption{Comparison of antecedent word-length for anaphors between the gold annotated dataset and results of the Stanford system}
%\label{tab: t6}
%\begin{tabular} { l  c c }
%\hline
%							& Gold standard 	& Stanford system\\	
%\hline
%1-5 words			 	& {80\%} 			& {65\%} \\ 
%6-9 words	 	 		& {15\%}  			& {19\%}  \\ 
%over 10words 			& {5\%} 			& {16\%} \\ 
%\hline

%\end{tabular}
%\end{table}










\section{Conclusion}



\section*{References}

\bibliography{mybibfile}

\end{document}